========================================================
The wonderful world of white wines! The sugary drink that almost anybody could love. In this summary, I will be comparing some crucial factors of white wines to figure out what makes them so good. So get ready, and lets dive in
## [1] 4898 13
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## 7 7 6.2 0.32 0.16 7.0 0.045
## 8 8 7.0 0.27 0.36 20.7 0.045
## 9 9 6.3 0.30 0.34 1.6 0.049
## 10 10 8.1 0.22 0.43 1.5 0.044
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## 7 30 136 0.9949 3.18 0.47 9.6
## 8 45 170 1.0010 3.00 0.45 8.8
## 9 14 132 0.9940 3.30 0.49 9.5
## 10 28 129 0.9938 3.22 0.45 11.0
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
## 7 6
## 8 6
## 9 6
## 10 6
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
Above, you will see the dimensions, first 10 lines, and a summary of each column. Looking at the above, you will notice that column X goes from 1 to 4898, which is how many observations we have. Because of this, I made the X column a factor, to label each individual wine.
With this data layed out, I am able to see the data and sections but unable to really visualize any information. Now I am going to move on to some univariate plots to display the data so I can visualize and analyze further.
In our dataset, the “quality” variable ranges between 3 and 9 with a median of 6, so there is neither very bad nor very excellent wine but mostly averge wines. Also, there are only 25 wines rated either 3 or 9.
The basic histogram shows that fixed acidity has really few values less than 3 and a long tail after 10. So I limit the x axis range. Changing binwidth also shows more clearly that the majority of the fixed acidities fall between 5.5 and 8.5.
After adjusting bin width, I can see that most wines have an acetic acid between 0.15-0.4g/l, with a median value at 0.28g/l.
The majority of citric acidity level fall between 0.15-0.5g/l with a spike at the level of 0.49g/l. In contrast to volatile acidity, citric acidity add freshness to the wine.
Most wines has an amount of sodium chloride between 0.025-0.06g/l, with a median of 0.043g/l. The highest level in this dataset is 0.346g/l.
The median value of free sulfur dioxide is 34 mg/l and it has a wide range from 2 to 289 mg/l with the majority of the value falling between 10-55 mg/l. Since free sulfur dioxide becomes noticeable at 50 mg/l, I assume it will affect the taste.
Similar to free sulfur dioxide, total sulfur dioxide also has a wide range from 9 to 440 mg/l with a median value at 134 mg/l.
Density is a very small range between 0.985 and 1.005.
the pH is between 2.72 and 3.82, which means that wine is on the acidic side.
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
An appropriate level of alcohol enhances the flavor but a high level of alcohol will cause a negative burning sensation. But our white wine dataset doesn’t appear to have very high alcohol level. The median is 10.4% and the majority of values fall between 9% to 13%.
Residual sugar has a wide range between 0.6-65.8g/l while the median is only 5.2g/l. This is because wine producers try to cater to varying consumers’ preference of sweetness. Some people like me favor sweet wines, while others might prefer bone dry.
Summary of Residual Sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Summary of Residual Sugar Log10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2218 0.2304 0.7160 0.6432 0.9956 1.8180
Summary of Residual Sugar squared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7746 1.3040 2.2800 2.3200 3.1460 8.1120
There are 4,898 different white wines in the data, with 11 features, all affecting the quality of the wine. In this study, I will look at the factors that are of most interest to me to find out if this will have any effect on the quality of the wine. The factor variable has been assigned to the quality, based on a level from 3 to 9, with 9 being the highest quality.
Some things that have been observed already is that the residual sugar of the wine quite lower than expected for white wines, with almost 2/3 of the wines in the list between 2 and 4 residual sugar. Also, the pH seems to a normal distribution.
The main feature of this dataset is the quality of the wine. All of the data revolves around the quality, with certain areas of interest such as the pH, residual sugar, and the acidity.
With all the data plotted like this, it is easy to make quick assumptions on the data and see what effect each variable has on one another. Lets take a closer look at a few of these variable!
This is a great model to show the representation of residual sugar and density. As you can see, the density increases with the residual sugar, creating a correlatoin between the two.
Free sulfur dioxide compared to total sulfur dioxide has an upward trend. This is expected, but removing outliers creates a more robust plot.
As noted previously, the density increase means a higher residual sugar. Note here that the increase in density also created a decrease in the alcohol percentage.
Alcohol amount vs. quality is a very interesting relationship looking at it as a boxplot. Interestingly enough, the alcohol percentage increases with quality of wine. There is a slight decline around quality level 5, but this can be attributed to the median quality being around this level.
## (7.99,9.24] (9.24,10.5] (10.5,11.7] (11.7,13] (13,14.2]
## 845 1730 1390 795 138
##
## Descriptive statistics by group
## group: (7.99,9.24]
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 845 0.28 0.1 0.27 0.27 0.07 0.1 0.82 0.71 1.47 3.2 0
## --------------------------------------------------------
## group: (9.24,10.5]
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1730 0.28 0.1 0.26 0.27 0.09 0.08 1 0.92 1.7 5.87 0
## --------------------------------------------------------
## group: (10.5,11.7]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1390 0.26 0.09 0.24 0.25 0.07 0.09 0.96 0.88 1.79 6.76
## se
## X1 0
## --------------------------------------------------------
## group: (11.7,13]
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 795 0.3 0.1 0.29 0.29 0.07 0.08 1.1 1.02 1.45 6.37 0
## --------------------------------------------------------
## group: (13,14.2]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 138 0.37 0.12 0.35 0.36 0.1 0.15 0.78 0.64 0.87 1.18
## se
## X1 0.01
There are a lot of interesting relationships in the data, such as alcohol vs quality. It is very interesting that the quality of the wine actually increases as the amount of alcohol increases. There are also interesting relationships between the residual sugar vs. density and the relationship of valatile acidity compared to alcohol.
Lets dig deeper into these relationships and explore some multvariate analysis to see where this takes us in the wine exploration
This chart is very interesting and descriptive in many ways. This chart factors in the quality of the wine as the color and builds the histogram on that. It seems that the higher residual sugar means a lower quality is most circumstances.
As you can see by the above two plots, the volatile acidity and Sulfur Dioxide based off of the acohol level has an opposit effect. This shows the the higher the alcohol content, the higher the acidity but lower sulfur dioxide
## Low Alcohol\nn = 8 Mid. Alcohol\nn = 12 High Alcohol\nn = 8
## 2086 2154 658
When I break down quality by alcohol level and volatile acidity, for alcohol group between 7.99-10.1%, the negative relationship between volatile acidity and quality becomes the strongest. For example, among 7.99-10.1% alcohol categories, the median value of volatile acidity decreases from 0.34 g/l for less desirable wines (quality = 4) to 0.19 g/l for highly rated ones (quality = 8); the former group also has a higher variation of volatile acidity (sd = 0.31) than the latter one (sd = 0.03).
##
## Descriptive statistics by group
## group: 3
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 20 10.35 1.22 10.45 10.34 1.19 8 12.6 4.6 0.02 -0.83
## se
## X1 0.27
## --------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 163 10.15 1 10.1 10.08 1.04 8.4 13.5 5.1 0.7 0.15 0.08
## --------------------------------------------------------
## group: 5
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1457 9.81 0.85 9.5 9.71 0.74 8 13.6 5.6 1.08 1.07
## se
## X1 0.02
## --------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 2198 10.58 1.15 10.5 10.52 1.33 8.5 14 5.5 0.4 -0.72
## se
## X1 0.02
## --------------------------------------------------------
## group: 7
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 880 11.37 1.25 11.4 11.42 1.33 8.6 14.2 5.6 -0.3 -0.56
## se
## X1 0.04
## --------------------------------------------------------
## group: 8
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 175 11.64 1.28 12 11.78 1.19 8.5 14 5.5 -0.89 0.01
## se
## X1 0.1
## --------------------------------------------------------
## group: 9
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 5 12.18 1.01 12.5 12.18 0.3 10.4 12.9 2.5 -0.98 -1.03
## se
## X1 0.45
Comparing wine quality vs. alcohol level is probaly what is the highest priority by wine makers. There are very few wines with a high quality, the majority is around quality level 3. But with a higher quality also comes a higher mean alcohol level and also a lower density. Alcohol level has a relatively small range from 8% to 14.2% and a median value at 10.4%. The majority of our wine ratings fall between 5-7. Except for rating 4 category probably due to relative small sample size, a better-rated wine has a higher alcohol level (the left chart).
##
## Descriptive statistics by group
## group: (7.99,9.24]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 845 41.51 15.79 42 41.34 16.31 5 128 123 0.32 0.93
## se
## X1 0.54
## --------------------------------------------------------
## group: (9.24,10.5]
## vars n mean sd median trimmed mad min max range skew
## X1 1 1730 37.54 18.09 36 36.72 19.27 3 138.5 135.5 0.6
## kurtosis se
## X1 0.78 0.44
## --------------------------------------------------------
## group: (10.5,11.7]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1390 32.28 17.19 31 31.09 14.83 2 289 287 3.34 38.23
## se
## X1 0.46
## --------------------------------------------------------
## group: (11.7,13]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 795 30.44 12.74 30 29.96 11.86 3 96 93 0.77 2.48
## se
## X1 0.45
## --------------------------------------------------------
## group: (13,14.2]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 138 27.88 12.14 28 27.35 13.34 3 65 62 0.4 -0.07
## se
## X1 1.03
##
## Descriptive statistics by group
## group: (7.99,9.24]
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 845 0.28 0.1 0.27 0.27 0.07 0.1 0.82 0.71 1.47 3.2 0
## --------------------------------------------------------
## group: (9.24,10.5]
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1730 0.28 0.1 0.26 0.27 0.09 0.08 1 0.92 1.7 5.87 0
## --------------------------------------------------------
## group: (10.5,11.7]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1390 0.26 0.09 0.24 0.25 0.07 0.09 0.96 0.88 1.79 6.76
## se
## X1 0
## --------------------------------------------------------
## group: (11.7,13]
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 795 0.3 0.1 0.29 0.29 0.07 0.08 1.1 1.02 1.45 6.37 0
## --------------------------------------------------------
## group: (13,14.2]
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 138 0.37 0.12 0.35 0.36 0.1 0.15 0.78 0.64 0.87 1.18
## se
## X1 0.01
The combined two charts below plot alcohol against free sulfur dioxide and volatile acidity. The higher the alcohol level is, the less the free sulfur dioxide will be. For example, the median free sulfur dioxide amount among 13-14.2% alcohol group is only 28 mg/l, much less than 42 mg/l among 7.99-9.24% alcohol group. The relatinship between acidity and alcohol becomes more clear among higher alcohol groups. For instance, the median volatile acidity amount among 13-14.2% alcohol group is 0.35 g/l as compared to 0.24 g/l among 10.5-11.7% alcohol group.
This chart is very interesting and descriptive in many ways. It shows the averages of the log10 residual sugar, and sort of makes a bimodal histogram. It also factors in the quality of the wine as the color and builds the histogram on that. It seems that the higher residual sugar means a lower quality is most circumstances.
This white wine dataset is the most tidy one I’ve ever used. However, I was frustrated in the beginning because except alcohol, almost all other input variables don’t have a strong relationship with wine quality. Reading correlation matrix is not enough. When conditioning on other relevant variables, the relationships between the physicochemical properties and quality became clear. Also, all input variables are continous variables which limited the type of graphs I could make. One solution I made was to recode to categorical variables.
The other problem I had is my knowledge about the physicochemicals and how they interacted were limited before starting this project. I had to resort to additional readings to brush up my wine knowledge.
This dataset is pretty limited with 13 input variables (technically 12 can be used for analysis because one of them is ID variable), it will be great if other variables such as grape type and wine age can be included for further investigation.